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This paper presents a performance model of the ATA/ATAPI SSD Trim command under various types of user 
workloads, including a uniform random workload, a workload with hot and cold data, and a workload with N 
temperatures of data. We first examine the Trim-modified uniform random workload to predict utilization, 
then use this result to compute the resultant level of effective overprovisioning. This allows modification of 
models previously suggested to predict write amplification of a non-Trim uniform random workload under 
greedy garbage collection. Finally, we expand the theory to cover a workload consisting of hot and cold data 
bJ} (and also N temperatures of data), providing formulas to predict write amplification in these scenarios. 
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1. INTRODUCTION 

NAND flash nonvolatile memory has become ubiquitous in electronic devices. In its 
form as a solid-state drive (SSD), it is often used as a replacement for the traditional 
hard disk drive, due to its faster performance. However, flash devices have some draw- 
backs, notably a performance slowdown when the device is filled with data, and a 
limited number of program/erase cycles before device failure. 

The reason flash-based SSDs slow as they fill with data is write amplification, in 
which more writes are performed on the device than are requested. This is due to 
the dynamic nature of the logical-to-physical mapping of data in conjunction with the 
erase-before-write requirement of NAND flash. A more detailed explanation of write 
amplification and its causes is in Section 2. 

Manufacturers use several techniques to reduce write amplification in an effort to 
mitigate the negative performance impact. Two of these techniques are physically over- 
provisioning the device with extra storage, and implementing the Trim command [Tri 
2011]. More information about these techniques is in section 2. 
The main contributions of this work can be summarized as follows. 
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— A Markovian birth-death model of random workloads, previously described by 
Frankie et al. [2012a] is shown, along with a new analysis of higher order terms. 

— The write amplification of the Trim-modified random workload, previously described 
by Frankie et al. [2012b], is extended to hot and cold data. 

2. A NAND FLASH PRIMER 

Some familiarity with NAND flash is required to understand the models presented 
in this paper. In this section, we summarize the necessary NAND flash background 
material for the reader. 

2.1. Flash Layout 

NAND flash is composed of pages, which are grouped together in physical blocks 1 . 
Pages are the smallest write unit, and hold a fixed amount of data, the size of which 
varies by manufacturer, and is typically 2K, 4K, or 8K in size [Grupp et al. 2009; 
Roberts et al. 2009]. Unlike with traditional magnetic media, pages cannot be directly 
overwritten, but must first be erased before they can be programmed again, a require- 
ment dubbed erase-before-write. Blocks are the smallest erase unit, and typically con- 
tain 64, 128, or 256 pages, again varying by manufacturer [Grupp et al. 2009]. 

Because of the erase-before-write requirement and the mismatch in sizes between 
the smallest write unit (pages) and the smallest erase unit (blocks), write-in-place 
schemes are impractical in flash devices. To overcome this issue, most flash devices 
employ a dynamic logical-to-physical mapping called the Flash Translation Layer 
(FTL) [Chung et al. 2009; Gupta et al. 2009]. There are many FTLs discussed in the 
literature; a survey and taxonomy of these is found in [Chung et al. 2009]. In this pa- 
per, we employ a log-structured SSD, as described by Hu et al. [2009], in which the 
data associated with a write request is written to the next available pages. If this data 
was intended to overwrite previously recorded data, the pages on which the data was 
previously stored are invalidated in the FTL mapping. 

2.2. Garbage Collection 

At some point in the process of writing data, most of the pages will be filled with either 
valid or invalidated data, and more space is needed for writing fresh data. Garbage 
collection occurs in order to reclaim additional blocks. In this process, a victim block 
is chosen for reclamation, its valid pages are copied elsewhere to avoid data loss, then 
the entire block is erased and the FTL map pointer is changed. Many algorithms exist 
for choosing the victim block, most of which are based on either selecting the block that 
creates the most free space or one that also takes into account the age of the data on 
the block, as originally described in the seminal paper by Rosenblum and Ousterhout 
[1992]. In this paper, we utilize the greedy garbage collection method: when the num- 
ber of free (erased) blocks in the reserve-queue drops below a predetermined thresh- 
old, r, the block with the fewest number of valid pages is chosen for reclamation, as 
explained in Algorithm 1. 

2.3. Write Amplification 

During garbage collection, extra page writes not directly requested of the SSD occur. 
These extra writes contribute to a phenomenon known as write amplification. Write 
amplification, measured as the ratio of the average number of actual writes to re- 
quested writes [Hu et al. 2009], slows the overall write speed of the device as it fills 



1 Note that physical blocks are different than logical blocks, and the type referred to when the word "block" 
is used needs to be determined from the context. In this paper, we attempt to specify logical or physical block 
if the context is potentially unclear. 
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ALGORITHM 1: Data Placement and Greedy Garbage Collection 

Input: LBA requests of types Write and Trim. 
Output: Data placement and garbage collection procedures, 
while there are LBA requests do 
if request is Write then 

write new data to first free page in current-block; 
if LBA written was previously In-Use then 

invalidate appropriate page in FTL mapping; 
end 

if sizeifree pages in current -block) = then 

append current-block to end of occupied-block-queue; 
current-block first block in reserve-queue; 
if size(reserve-queue)< r then 
// garbage collection: 

victim-block <— block with fewest valid pages from occupied-block-queue; 
copy valid pages from victim-block to current-block; 
erase victim-block and append to reserve-queue; 
end 
end 
else 

// request was Trim 

invalidate appropriate page in FTL mapping; 
end 
end 



with data and reduces the lifetime of the device, since each page has a limited number 
of program/erase cycles before it wears out and can no longer be used to store data. 

Most FTLs and garbage collection algorithms attempt to reduce write amplification 
and perform wear leveling, attempting to keep the number of program/erase cycles 
uniform over all blocks. Greedy garbage collection has been proven optimal for random 
workloads in terms of write amplification reduction [Hu and Haas 2010]. 

2.3.1. Overprovisioning. In addition to FTLs and garbage collection methods that at- 
tempt to work well with various workload, manufacturers use other ways to reduce 
write amplification. One of these methods is through physically overprovisioning the 
flash device by providing more physical storage than the user is allowed to logically 
access. Overprovisioning reduces write amplification by increasing the amount of time 
between garbage collections of the same block, which in turn increases the probability 
that more pages in the block are invalid at the time of garbage collection, and thus need 
not be copied. The mathematical relationship between overprovisioning and write am- 
plification in uniform random workloads has been studied by Hu et al. [2009] and Hu 
and Haas [2010], Agarwal and Marrow [2010], and Xiang and Kurkoski [2012]. 

In this paper, we measure the level of overprovisioning using the spare factor: Sf = 
(t — u)/t. The spare factor has a range of [0,1], making comparisons of various spare 
factors intuitive. 

2.3.2. Trim. Another way that write amplification can be reduced is through use of 
the Trim command, defined in the working draft of the ATA/ATAPI Command Set - 2 
(ACS-2) [Tri 2011]. The Trim command allows the file system to communicate to the 
SSD that data is no longer needed, and can be treated as invalidated. This means that 
the data in these pages does not need to be retained during garbage collection, reducing 
the number of extra writes that garbage collection generates. As described by Frankie 
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et al. [2012a], and detailed in section 4, the Trim command effectively increases the 
amount of overprovisioning on the SSD. 



3. WORKLOADS 

There are many different workloads seen in the real world, from a single user of a 
personal computer, to multiple users of cloud computing, database transactions, se- 
quential serial access, and random page access. However, workloads can be seen as 
existing along a spectrum from purely sequential requests to completely random re- 
quests, with most real-world workloads falling between the two extremes. Purely se- 
quential workloads are not very interesting to study in conjunction with write ampli- 
fication in SSDs, because by nature of being sequential, entire blocks are invalidated 
as requests continue. This means that at garbage collection, there is no valid data that 
needs to be saved in the pages of the block being reclaimed, and so there are no extra 
writes to contribute to write amplification [Hu et al. 2009]. On the other hand, random 
workloads cause considerable write amplification; the effect on write amplification of 
uniform random workloads without Trim has been studied by Hu et al. [2009] and Hu 
and Haas [2010], Agarwal and Marrow [2010], and Xiang and Kurkoski [2012]. The ef- 
fect on write amplification of uniform random workloads with Trim has been analyzed 
by Frankie et al. [2012b]. 

For the study of write amplification, read requests are irrelevant, and so are not 
considered in the uniform random workload. The standard uniform random workload 
studied by Hu et al. [2009] and Hu and Haas [2010], Agarwal and Marrow [2010], 
and Xiang and Kurkoski [2012] consists only of Write requests that are uniformly 
randomly chosen over all u user logical block addresses (LBAs). To simplify the math 
involved in the analysis, it is assumed that the data associated with a write request 
to one LBA will occupy one physical page. Frankie et al. [2012b] introduced Trim re- 
quests to the standard uniform random workload. In this Trim-modified uniform ran- 
dom workload, requests are split between Writes and Trims, with Trims occurring with 
probability q. Write requests are uniformly random over all u user LBAs, and Trim re- 
quests are uniformly random over all In-Use LBAs. Frankie et al. [2012b] define an 
In-Use LBA as one whose most recent request was a Write; all other LBAs, includ- 
ing those whose most recent request was a Trim, or that had never been written, are 
considered not In-Use. 



4. A MODEL FOR TRIM 

We have previously shown that a Trim-modified uniform random workload can be mod- 
eled as a Markov birth-death chain [Frankie et al. 2012a]. We summarize the results 
here, showing the necessary details in order to also perform a higher-order analysis. 
For the convenience of the reader, a list of variables and their descriptions used in the 
paper is given in Table I. 



4.1. Trim-Modified Uniform Random Workload as a Markov Birth-Death Chain 

The Trim-modified uniform random workload is modeled as a Markov birth-death 
chain in which the number of In-Use LBAs at time n is the state, X n . Write requests 
can either increase the state by 1 or leave it unchanged, and Trim requests reduce 
the state by 1. The problem-specific transition probabilities of the Markov chain are as 
follows: 
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Table I. List of variables and their descriptions 



Variable 


Description 


t 


Number of pages on SSD 


u 


Number of pages user is allowed to fill 


n p 


Number of pages in a block 


r 


Minimum number of reserved blocks in reserve-queue 


S f = (t- u)/t 


Manufacturer-specified spare factor 


7~, 1, / 

p = [t — u)/u 


Alternate measure of overprovisioning 


Q 


Probability of Trim request 




Unnormatized steady-state probabilities 


s = (1 - 2q)/{l - q) 


Steady-state probability of a Write being for an In-Use LBA 


s = q/{l-q) 


Steady-state probability of a Write being for a not In-Use LBA 


S eS ={t-X n )/t 


Effective spare factor 


o u — Seff - 1+p 1 
Peff 1-S pff s 1 


Alternate measure of effective overprovisioning 


u eB = us 


Mean number of In-Use LBAs at steady-state 


a z = us 


Variance of the number of In-Use LBAs at steady-state 


fhJc 


Fraction of the data that is hot or cold 


Ph,Pc 


Probability of requesting hot or cold data 



P(X n + 1 = X - 1\X; 

P(X n+1 = x\X, 
P(X n+1 = x + 1\X, 

where q x is the probability of a Trim request, r x is the probability of a Write request for 
an In-Use LBA, and p x is the probability of a Write request for a not In-Use LBA. Note 
that the value of q x is a constant q regardless of the state, but r x and p x are dependent 
on the number of In-Use LBAs. 

The steady-state occupation of a Markov birth-death chain is known to be [Hoel et al. 
1971] 



-^unnormalized (^steady % 

Substituting for the problem-specific values, we find 

n x =(— q -X "' Vxe{o,..., w } (l) 

\ q J u x (u — xjl 

From an inspection of Equation 1, it is not straightforward to understand the effect of 
the values of u and q on the distribution. However, this equation can be manipulated 
into a more comprehensible form, as shown in the next subsection. 



x) =q x = q 



x) 



Px = 



(1-9) 



Ptr--Px~i 



X ^ 1 
X = 



4.2. Understanding the Steady-State Occupation Probabilities 

In this section, we manipulate Equation 1 into a more familiar form, while maintain- 
ing asymptotic equivalence to the normalization factor (denoted with ~). First, define 

s = f-j-q^J an( i s = 1 — s = f]4g)> noting that s + s = 1. With these substitutions, 

Equation 1 becomes 
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The exponential terms suggest that working in log-space may be easier: 

log {ir x ) ~ -log((u-ar)!) - a; log (us) 

At this point, it is useful to apply Stirling's formula for factorials: log (n!) pa nlog (n) — 
n, using sa to denote an asymptotic approximation. It is well known that Stirling's 
approximation for factorials converges very quickly, and for relatively small values of 
n can practically be considered to provide a "ground truth" equivalent for the value of 
the factorial function 2 . The application of Stirling's approximation to this problem is 
reasonable. For SSDs of several GB, u is on the order of hundreds of thousands, so the 
Stirling approximation of (u — x)\ is effectively ground truth except for a few x very 
close in value to u. 

log (tt x ) pa — (u — x) log (u — x) + (u — x) — x log (us) 

(u - x) log {^—f-^ + {u - x) (2) 

A Taylor series expansion is a natural way to handle the log term. First the definition 
er 2 = us and change of variables 



us 

helps to put the log term into standard form to apply the Taylor series expansion 



log(l - a) = — y~* — for - 1 < a < 1 
^-^ n 

n=l 

Then, 

log (ir x ) Pa -cr 2 (l - a) log(l - a) + cr 2 (l - a) 

cr 2 (l - a) 





-a) 




{ r 


-°*\ 




a 2 




2 





a'- 
n 




a 2 (l-a) 



n 2 



^-)+a 2 (l- Q ) 



— (x — us) 2 
2us 



(3) 



2 Stirling's Formula, one of several possible Stirling approximations to the factorial function, provides a 
highly accurate approximation to nl for values of n as small as 10-to-100. For this reason, for even moder- 
ately sized values of n, the Stirling Formula is used to derive a variety of useful probability distributions in 
fields such as probability theory [Feller 1968], equilibrium statistical mechanics [Reif 1965], and informa- 
tion theory [Cover and Thomas 2006]. Because of the remarkable accuracy of the Stirling Approximation, 
which very rapidly increases with the size of n, the resulting distributions can be taken to be "ground truth" 
distributions in many practical domains. Derivations of the Stirling Approximation, and discussions of its 
accuracy and convergence, can be found in pages 52-54 of [Feller 1968], Appendix A.6 of [Reif 1965], and 
pages 522-524 of [Courant and Hilbert 1953]. 
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and we see that the Taylor series first-order approximation of tt x is an unnormalized 
Gaussian random variable with mean us and variance us. 3 

4.3. Higher Order Terms 

The analysis above allows us to compute higher-order moments such as skewness and 
kurtosis by truncating the Taylor series expansion at later terms. The skewness is 
calculated as 4 

Skew x = -I+ijf-lj W 
and the kurtosis is calculated as 

Kurtosis^ = — ^> + R ( —r) (5) 

4er 2 V / 

For the details of the derivation of the skewness and kurtosis, see [Frankie 2012]. 

The slight negative skew is toward a higher level of effective overprovisioning. This 
is more evident for small u with a narrow variance (low but non-zero q). Fig. 2 shows 
simulation, exact analytic values, and approximate Gaussian values for very small 
u = 25. With this small of a u, the slight negative skew is evident. It is unlikely that 
such a low value of u would every be encountered in a real device; the higher-order 
analysis is included here for completeness. 

A more detailed asymptotic analysis given in [Frankie 2012], shows that even in the 
limit of large u there is an irreducible, residual absolute error in the location of the 
mean of 0.5 which, consistent with the theory of asymptotic analysis, is asymptotically 
relatively negligible lim^oo O.b/u — 0. We correct for this known residual error in the 
approximation models via a shift, which results in the shown fits in Fig. 1 and Fig. 2. 

4.4. Effective Overprovisioning from Trim 

We measure the level of effective overprovisioning through the effective spare factor, 
defined as 

The mean and variance of the effective spare factor are calculated as 

t — us 



S e ff — 



t 

s + sS 



and 



/ 



Var(S eff ) = ivar(X„ 

us 

_ 8(1 - Sf) 

t 



3 For proof of convergence of the Taylor series within the range needed for the problem, please refer to 
[Frankie et al. 2012a]. 

4 -R y~s) i s defined to mean that the remaining terms are at least as small as O i n terms of the "big 

O" terminology. 
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Note that the mean effective spare factor is dependent only on the manufacturer- 
specified spare factor and the amount of Trim in the workload. However, the variance 
of the effective spare factor also depends on the absolute size of the device, and tends 
toward zero as the device size gets large. This means that for large enough SSDs, the 
penalty for ignoring the variance of the effective spare factor is negligible. 

4.5. Simulation Results and Practical Application 

Fig. 1 demonstrates the accuracy of the Stirling and Gaussian approximations for a 
very low u = 1000 and q = 0.4. The low value of u and high value of q were necessary in 
order to keep the graph from looking like a delta function, which is what a zoomed-out 
pdf of larger, more practically realistic, values of u appears to be. Even at such a low 
value of u, the simulation results, exact analytic pdf (Equation 1), Stirling approxima- 
tion of the pdf (Equation 2), and the Gaussian approximation of the pdf (Equation 3) 
are indistinguishable. In order to see the slight skew predicted by higher order terms, 
we are forced to use an unrealistically low value of u = 25, as seen in Fig. 2. However, 
even in this unrealistically small example, both the Stirling and Gaussian approxima- 
tions are close matches to the exact pdf. 

Fig. 3 illustrates the utilization of the user space as the percentage of In-Use LBAs 
at steady state for several rates of Trim. Higher rates of Trim correspond to lower 
utilization, but with a wide variance. The corresponding graph of the pdf of the steady 
state effective spare factor is found in Fig. 4, with the example utilizing a manufacturer 
specified spare factor of 0.11. Here, higher rates of Trim correspond to a higher effec- 
tive spare factor, but with a wide variance. Fig. 5 relates the manufacturer specified 
spare factor to the effective spare factor, showing both the mean and 3u standard devi- 
ations. In this figure, we see that a workload of 10% Trim requests transforms an SSD 
with a manufacturer-specified spare factor of zero into one with an effective spare fac- 
tor of 0.11. To obtain this spare factor through physical overprovisioning alone would 
require 12.5% more physical space than the user is allowed to access! Implementing 
the Trim command on an SSD allows for comparable performance without requiring 
extra physical materials, which lowers the price of making the SSD. Understanding 
the relationship between the workload and the performance of the device is necessary 
for consumers to be able to accurately compare SSD products; the theory in this paper 
sets out a framework to model this relationship. 

5. THE EFFECT OF TRIM ON WRITE AMPLIFICATION 

Hu et al. [2009] and Hu and Haas [2010], Agarwal and Marrow [2010], and Xiang 
and Kurkoski [2012] have each derived an analytic approach for computing the write 
amplification under the standard uniform random workload. We have previously pub- 
lished [Frankie et al. 2012b] an analysis of the Trim-modified uniform random work- 
load that parallels the derivation by Xiang and Kurkoski [2012]. 

5.1. Analysis of the Standard Uniform Random Workload 

5.1.1. Hu's Approach. Hu et al. [2009] and Hu and Haas [2010] utilize an empirical 
model to compute the write amplification of a purely random write workload under 
greedy garbage collection, noting that the value of parameters other than the device 
utilization do not substantially effect the write amplification. The equations needed to 
compute write amplification under their model are found in Fig. 6. 

5.1.2. Agarwal's Approach. Agarwal and Marrow [2010] probabilistically model the 
write amplification of a purely random write workload under greedy garbage collec- 
tion. They observed that the distribution of the number of blocks with v valid pages 
can be approximated as a uniform distribution, then equate the expected value of this 
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Fig. 1 . A histogram of one run of simulation results for u = 1000 and q = 0.4, along with plots of the 
exact analytic pdf (Equation 1) in red, the Stirling approximation of the pdf (Equation 2) in black dots, 
and the Gaussian approximation of the pdf (Equation 3) in green. The line for the exact analytic pdf is not 
even visible because the line for the Gaussian approximation of the pdf covers it completely. The Stirling 
approximation of the pdf (black dots) lines up well with the Gaussian approximation and the (covered) 
analytic pdf. Even at this low value of u = 1000, the theoretical approximations are effectively identical to 
the exact analytic pdf. 

uniform distribution with the expected value of the binomially distributed number of 
valid pages in a block chosen for garbage collection 5 . The resulting equation predicts 
write amplification as 

^Agarwal = ^ ( ~ ^ 

where p = (t — u)/u is a measure of the level of overprovisioning. 

5.1.3. Xiang's Approach. Xiang and Kurkoski [2012] improve the approach taken 
by Agarwal and Marrow [2010]. They probabilistically analyze the number of invalid 
pages freed by garbage collection, assuming that it reaches a steady state with con- 
stant expected value. Additionally, they compute the probability of a page being invalid 
in a block chosen for garbage collection, then utilize the fact that all pages in the block 



5 The number of valid pages in a block chosen for garbage collection is binomially distributed due to the 
random nature of the write requests. 
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Comparison of Analytic and Simulation pdf 
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Fig. 2. A histogram of one run of simulation results for u = 25 and q = 0.3, along with plots of the 
exact analytic pdf (Equation 1) in red, the Stirling approximation of the pdf (Equation 2) in black, and 
the Gaussian approximation of the pdf (Equation 3) in green. A slight negative skew of the exact analytic 
pdf is evident when compared to the Gaussian approximation. Note that the exact analytic pdf is the best 
fit of the histogram. A series of 64 Monte-Carlo simulations with these values result in an average skew 
of mean ± one standard deviation = —0.299 ± 0.004, which agrees well with the theoretically computed 
approximation value of —0.306 ± R (^sj = -0-306 ± iJ(0.029). The average simulation kurtosis is 0.064; 

the approximate theoretical is 0.070, both values well within overlapping error bounds. The top graph is 
a plot of the numerically normalized equations as shown in the text. In the bottom graph, a half-bin shift 
correction has been made to the Stirling and Gaussian approximation densities as described in the text. 

are equally likely to be invalid to compute the binomially distributed expected number 
of invalid pages to be freed by garbage collection. Equating these values, then taking 
the limit as the device becomes large, they find the write amplification is 



.4 



Xiang 



-i-P 



-1 - p -W((-l - p)eS- 1 -P)) 



(10) 



where W(- 
2012]. 



denotes the Lambert W function [Corless et al. 1996; Xiang and Kurkoski 



5.2. Analysis of the Trim-Modified Uniform Random Workload 

5.2.1. Paralleling Xiang's Approach. In [Frankie et al. 2012b], we added Trim commands 
to the purely random write workload analyzed by Hu et al. [2009] and Hu and Haas 
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Fig. 3. Theoretical results for utilization under various amounts of Trim. When more Trim requests are in 
the workload, a lower percentage of LBAs are In-Use at steady state, but the variance is higher than seen 
in a workload with fewer Trim requests. 



[2010], Agarwal and Marrow [2010], and Xiang and Kurkoski [2012], then paralleled 
the derivation of Xiang and Kurkoski [2012] to compute the write amplification under 
greedy garbage collection, finding 6 



-(i+p) 



4<T) 

^Xiang 



-(I+P) 



w 



-(1+p). 



-(1+p) 



(11) 



5.2.2. Modification of Equations for Hu and Agarwal Approaches. A comparison of ^xi'ang an< ^ 
^4xiang reveals that they are the same equation, but with 



PeS 



Sefi 
1 - S e f[ 



l + p 



(12) 



6 We denote the calculation of write amplification with TRIM requests in the workload as A ( 
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pdf of Effective Spare Factor at Steady State 
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Fig. 4. Theoretical results for the effective spare factor under various amounts of Trim, with a manufacturer 
specified spare factor of 0.11. When more Trim requests are in the workload, a higher effective spare factor 
is seen, but with a higher variance than seen in a workload with fewer Trim requests. 

in place of p [Frankie et al. 2012b]. This value for p eS can be placed into A Aga rwai to 
obtain 

Agarwal 2 \1 + p - S 

as an expression accounting for Trim requests. An u can also be modified by changing 
the equations in Fig. 6 to 
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"cir 



where u e g = us [Frankie et al. 2012b]. 
5.3. Simulation Results 

A comparison of simulation and theoretical results is found in Fig. 7. ^xing gives a 
very good approximation of the simulated write amplification values. and Agarwal 
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Fig. 5. Theoretical results comparing the effective spare factor to the manufacturer-specified spare factor. 
The mean and 3<r standard deviations are shown. For this figure, a small t = 1280 value was used so that the 
3cr values are visible. Larger gains in the effective spare factor are seen for smaller specified spare factor. In 
this figure, we see that a workload of 10% Trim requests transforms an SSD with a manufacturer-specified 
spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical 
overprovisioning alone would require 12.5% more physical space than the user is allowed to access! 

are good approximations for low probability of Trim requests, corresponding to low val- 
ues of /j e ff. However, for larger values of q, corresponding to high values of /j e ff, and 
^Agarwai are optimistic in their predictions of write amplification, with A^ arwal becom- 
ing so optimistic that it predicts an impossible write amplification of less than 1. 

6. HOT AND COLD DATA 

The Trim-modified uniform random workload analyzed above is unlikely to be found 
exactly in real-world applications. Most workloads will have some data that is accessed 
frequently (called hot data), and some data that is accessed infrequently (called cold 
data). Rosenblum and Ousterhout discussed this type of data in their paper [Rosen- 
blum and Ousterhout 1992], suggesting that an appropriate mix of hot and cold data 
is that hot data is requested 90% of the time, but only occupies 10% of the user space, 
and cold data is requested 10% of the time while occupying 90% of the user space. 
Hu et al. [2009] and Hu and Haas [2010] showed through simulation that a standard 
uniform random workload consisting of hot and cold data under the log-structured 
writing and greedy garbage collection described above had high write amplification, 
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Computation of Write Amplification Factor 
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Compute the write amplification, 
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Fig. 6. Equations needed to compute the empirical model-based algorithm for Ajj u (adapted from [Hu et al. 
2009; Hu and Haas 2010] and [Frankie et al. 2012b]). w is the window size, and allows for the windowed 
greedy garbage collection variation of greedy garbage collection. Setting w = -p — r is needed for the 

standard greedy garbage collection discussed in this paper. For an explanation of the theory behind these 
formulas, see [Hu et al. 2009; Hu and Haas 2010]. 



but when the hot data and the cold data were kept in separate blocks, the write am- 
plification improved. In their simulation, the cold data was read-only. We generalize 
the workload to allow cold data to be written and Trimmed infrequently compared to 
the hot data. In this section, we perform analysis of a Trim-modified uniform random 
workload with hot and cold data in which the hot and cold data are kept separate, and 
use simulation to examine the resulting write amplification when hot and cold data 
are mixed together in blocks. We assume that the temperature of the data is perfectly 
known; in reality, this information would need to be learned by the device; learning 
the temperature of the data is beyond the scope of this paper. 
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Write Amplification vs. Probability of TRIM Request 
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Fig. 7. Simulation and theoretical results for write amplification under various amounts of Trim and Sf = 
0.1. The Trim-modified Xiang calculation is a very good match. The Trim-modified calculations for Hu and 
Agarwal are optimistic about the write amplification, with the Agarwal calculation being so optimistic as to 
give an impossible write amplification of less than 1 for high amounts of Trim! 

6.1. Mixed Hot and Cold Data 

In the mixed hot and cold data scenario, no special treatment is given to the data 
based on its temperature; hot and cold data are mixed within blocks. However, when 
all LBAs are not accessed uniformly, as with hot and cold data, the theory already 
described for computing write amplification does not hold. We show this for the mixed 
case of hot and cold data by using the effective overprovisioning to compute the write 
amplification and comparing it with the simulation values. To compute the effective 
overprovisioning, first compute the expected number of In-Use LBAs for hot and cold 
data, the add these to get u e ff- Then, the effective overprovisioning is p e ff = (* — w e ff) / u e & 
and can be plugged directly into the p of Equation 10 to compute the (incorrect) write 
amplification. 

As seen in Figure 8, the theoretical results are not too far off when cold data is up 
to 20% of the user space. However, for larger fractions of cold data in the user space, 
the theoretical results are too optimistic to be of value, and the actual observed write 
amplification grows large. This is undesirable, since cold data is likely to occupy a 
larger fraction of the user space than hot data. 

An analytic solution for the write amplification of mixed hot and cold data is not 
straightforward, and does not easily parallel the derivation of Xiang and Kurkoski 
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Write Amplification For Mixed Hot and Cold Data 
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Fig. 8. Write amplification for mixed hot and cold data. Physical spare factor is 0.2. Trim requests for cold 
data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 
10% of the time, with hot data requested the remaining 90% of the time. Notice that the theoretical values, 
based only on the level of effective overprovisioning, are optimistic when cold data accounts for more than 
20% of the user space. 



[2012]. It is known that mixing hot and cold data leads to higher write amplification 
than keeping the data separate Hu et al. [2009] and Hu and Haas [2010], so a math- 
ematical analysis of the scenario is not likely to provide utility beyond mathematical 
interest. Instead, we provide simulation results in Fig. 8 for comparison purposes with 
the separated data scenario. 



6.2. Separated Hot and Cold Data 

Based on the results from Hu et al. [2009] and Hu and Haas [2010], we expect to see 
an improvement in the write amplification over the mixed data case. In the separated 
data scenario, a set of physical blocks are assigned to the cold data, and a separate 
set of physical blocks is assigned to the hot data. Each set of blocks is kept logically 
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separated 7 , and data placement and garbage collection for each temperature operates 
independently, in the same way described in Algorithm 1. 

Assume that h blocks are assigned to the hot data, and c blocks are assigned to the 
cold data, where h + c = t. Also assume that f h fraction of the data is hot, and f c 
fraction of the data is cold (note that fa + f c = 1), so that hot data has uf h LBAs and 
cold data has uf c LBAs. We also allow the hot and cold data to have their own rates of 
Trim, qu and q c , respectively 8 . Finally, the frequency at which requests arrive for hot 
and cold data 9 is denoted as pn and p c , respectively, where ph + p c = 1. Although it 
may take longer for cold data to reach steady-state than it takes for hot data, we can 
compute the expected number of In-Use LBAs for hot and for cold data at steady-state. 
We expect to see ufhSh hot In-Use LBAs, and uf c s c cold In-Use LBAs, with the number 
of valid pages the same as the number of In-Use LBAs for each temperature. From 
this, we can compute the level of effective overprovisioning for each temperature, and 
then the write amplification. 

-( 1 +Phot) 

-(1+Phot) 



MT, hot) Sfe MQ'V 

A ^ - ~ ; ? -(i+p h ot)\ (16} 



W Zii^£l°U e s h 



where phot = h u ff h ■ A similar equation can be computed for A^™ 1 *. Finally, the two 
values are combined in a weighted fashion to obtain the overall write amplification. 
Although it may seem the obvious choice, the weighting to use is not the ratio of hot 
and cold requests to the total number of requests; it is necessary to compute a ratio of 
hot and cold writes to the total number of writes 

ah = — n \~ i n \ (14) 

Ph(l - qh) +Pc(l - q c ) 



Pc(l - Qc 



PhO- - qh) +p c (i - q c ) 



(15) 



Separated = "^Xiang + "^Xiang W 

This analysis points to an optimization problem in how to allocate the overprovi- 
sioned physical blocks on the device after the minimum number of blocks needed for 
each temperature is assigned. For the parameters chosen in graphing the results (see 
Fig. 9), an equal allocation of the overprovisioned physical blocks is optimal. However, 
it would require further exploration to determine whether this is true for all choices of 
parameters of if it is dependent on the parameter values. 

6.3. Simulation Results 

An examination of Fig. 10 shows that physically separating hot and cold data reduces 
the write amplification over allowing the data to be mixed within blocks, unless cold 



7 The complete physical segregation of hot and cold blocks is useful for mathematical analysis. However, 
in an actual device, wear leveling is an issue that needs to be addressed. For this reason, physical blocks 
will occasionally need to be swapped between the hot and cold queues to spread the wear evenly. If this 
is timed to coincide with near-simultaneous garbage collection of hot and cold blocks, the amount of write 
amplification contributed by this wear leveling is expected to be minimal. 

8 Allowing each temperature of data to have its own rate of Trim makes sense, as there is no reason to believe 
that hot and cold data will be Trimmed at the same rate. 

9 We expect requests for hot data to arrive more frequently than requests for cold data, hence the choice of 
names for each. However, the analysis of this section does not require this to the be case. 
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Write Amplification For Hot and Cold Data 
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Fig. 9. Write amplification for separated hot and cold data. Physical spare factor is 0.2. Trim requests for 
cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is 
requested 10% of the time, with hot data requested the remaining 90% of the time. After allocating the 
minimum number of blocks needed to store all of the user-allowed data for each temperature of data, the 
remaining blocks were split with the fraction shown on the x-axis. In this case, the optimal split was 50-50; 
further investigation of the optimization problem is needed to determine whether this is always the case. 

data occupies only a very small fraction of the user space. However, it is generally 
expected that infrequently accessed (cold) data, will be a large fraction of the data 
stored, so accurate determination of hot versus cold data in order to keep them on 
physically separate blocks is very important. 

7. N TEMPERATURES OF DATA 

The idea of hot and cold data can be extended to N temperatures of data. This can be 
used to model a more complicated workload consisting of N temperatures of data, each 
with its own profile for probability of Trim. Additionally, the idea of N temperatures 
of data can also be thought of as many users who access data at different rates and 
with different amount of Trim in their workloads, which could be helpful in analyzing 
cloud storage workloads. If each temperature of data is kept separate, the analysis of 
separated hot and cold data is easily extended to the case of N temperatures. 

Assume there are N temperatures, with each temperature j e [1, N] having access 
to Uj LBAs. Then, the total number of user LBAs is u = m + u 2 + ■ ■ ■ + u N = J2f=i u r 
Also assume that each temperature has its own probability of Trim qj (and associated 
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Write Amplification For Hot and Cold Data 
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Fig. 1 0. Write amplification for mixed hot and cold data compared to separated hot and cold data. Physical 
spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur 
with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% 
of the time. For the separated hot and cold data, extra blocks were allocated evenly between the two data 
types. The theoretical values for separated data are a good prediction of the simulation values. Note that 
the separated hot and cold data has superior write amplification to the mixed hot and cold data. 

Sj), that is constant over time. Then at steady state, the number of In-Use LBAs from 
user j is X, with a mean oiujSj and a variance of ujSj. 

Define the random variable Y to be the total number of In-Use LBAs. Then Y = 
Y^j=i^i- Because X, are independent Gaussian random variables, their sum, Y, is 

also Gaussian, with mean J2f=i u j s i an & variance Ylf=i u i^i- This information can be 
used to compute the overall effective overprovisioning. However, as demonstrated in 
the hot and cold data section, the overall effective overprovisioning is not useful in 
computing the write amplification of a non-uniform workload. 

The write amplification for N temperatures of data can be computed if the data is 
kept separated, as described in Section 6.2 10 . First, assume each temperature j has 
kj space available for writing (so that J2jLi kj = t) and frequency of writing pj (so that 
Y^j=\Pi = !)■ Then compute the value of pj = kj ~ Uj for each temperature j, and use 



Wear leveling would still need to be taken into consideration in a real device. 
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this to compute the write amplification for each temperature 




) 



At first glance, it may appear that the overall write amplification is just a weighted 
sum of the individual write amplifications, where the weight is Pj. If there are no Trim 
requests, this would be true. However, the Trim requests in the workload need to be 
accounted for. To properly weight the individual temperature write amplifications, we 
need a measure of the ratio of writes requested by each user. Compute the weights 
a j = ^zt^—ttz Then, the overall write amplification is 



8. CONCLUSION 

In this paper, a comprehensive model has been presented for the Trim command in 
a uniform random workload. The steady state number of In-Use LBAs was shown 
to be well approximated as a Gaussian random variable, but the analysis allows for 
inclusion of higher-order terms in order to compute higher order non-Gaussian correc- 
tions to moments such as skewness and kurtosis. The steady-state number of In-Use 
LBAs was used to compute the level of effective overprovisioning, which allowed for 
the adaptation of previous non-Trim models to compute write amplification under a 
log-structured file system utilizing greedy garbage collection. 

The Trim-modified uniform random workload was further extended to include var- 
ious temperatures of data, allowing for closer approximation of real-world workloads. 
The write amplification under this workload with up to N temperatures of data was 
analytically computed under the condition that each temperature of data is stored in a 
physically segregated fashion. For the case of N = 2, hot and cold data, the importance 
of separating different temperatures of data in order to reduce write amplification 
was demonstrated. Additionally, an optimization problem of how to allocate overpro- 
visioned blocks to different temperatures of data was recognized, with the simulation 
result solution for one set of parameters shown. Further work is needed to determine 
the optimal solution for all parameters. 

Although this model is comprehensive, it still has limitations because of the assump- 
tions made. In this paper we used the standard assumption that one LBA requires one 
physical page in order to store the data [Hu et al. 2009; Hu and Haas 2010; Agarwal 
and Marrow 2010; Xiang and Kurkoski 2012], which is equivalent to saying that all 
files stored are the same size. However, in a real system, files are likely to be of vary- 
ing sizes, perhaps requiring even more than one physical block in order to be stored; 
future work should incorporate this idea into the model. 
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